164 research outputs found
Pairwise gene GO-based measures for biclustering of high-dimensional expression data
Background: Biclustering algorithms search for groups of genes that share the same
behavior under a subset of samples in gene expression data. Nowadays, the biological
knowledge available in public repositories can be used to drive these algorithms to
find biclusters composed of groups of genes functionally coherent. On the other hand,
a distance among genes can be defined according to their information stored in Gene
Ontology (GO). Gene pairwise GO semantic similarity measures report a value for each
pair of genes which establishes their functional similarity. A scatter search-based
algorithm that optimizes a merit function that integrates GO information is studied in
this paper. This merit function uses a term that addresses the information through a GO
measure.
Results: The effect of two possible different gene pairwise GO measures on the
performance of the algorithm is analyzed. Firstly, three well known yeast datasets with
approximately one thousand of genes are studied. Secondly, a group of human
datasets related to clinical data of cancer is also explored by the algorithm. Most of
these data are high-dimensional datasets composed of a huge number of genes. The
resultant biclusters reveal groups of genes linked by a same functionality when the
search procedure is driven by one of the proposed GO measures. Furthermore, a
qualitative biological study of a group of biclusters show their relevance from a cancer
disease perspective.
Conclusions: It can be concluded that the integration of biological information
improves the performance of the biclustering process. The two different GO measures
studied show an improvement in the results obtained for the yeast dataset. However, if
datasets are composed of a huge number of genes, only one of them really improves
the algorithm performance. This second case constitutes a clear option to explore
interesting datasets from a clinical point of view.Ministerio de Economía y Competitividad TIN2014-55894-C2-
Evolutionary Metaheuristic for Biclustering based on Linear Correlations among Genes
A new measure to evaluate the quality of a bicluster is proposed in
this paper. This measure is based on correlations among genes.
Moreover, a new evolutionary metaheuristic based on Scatter
Search, which uses this measure as the fitness function, is presented
to obtain biclusters that contain groups de highly-correlated genes.
Later, an analysis of the correlation matrix of these biclusters is
made to select these groups of genes that define new biclusters with
shifting and scaling patterns. Experimental results from human B cell lymphoma are presented.Ministerio de Ciencia e Innovación TIN2007-68084-C02Junta de Andalucía P07-TIC-0261
Biclustering of Gene Expression Data by Correlation-Based Scatter Search
BACKGROUND: The analysis of data generated by microarray technology is very useful to understand how the genetic information becomes functional gene products. Biclustering algorithms can determine a group of genes which are co-expressed under a set of experimental conditions. Recently, new biclustering methods based on metaheuristics have been proposed. Most of them use the Mean Squared Residue as merit function but interesting and relevant patterns from a biological point of view such as shifting and scaling patterns may not be detected using this measure. However, it is important to discover this type of patterns since commonly the genes can present a similar behavior although their expression levels vary in different ranges or magnitudes. METHODS: Scatter Search is an evolutionary technique that is based on the evolution of a small set of solutions which are chosen according to quality and diversity criteria. This paper presents a Scatter Search with the aim of finding biclusters from gene expression data. In this algorithm the proposed fitness function is based on the linear correlation among genes to detect shifting and scaling patterns from genes and an improvement method is included in order to select just positively correlated genes. RESULTS: The proposed algorithm has been tested with three real data sets such as Yeast Cell Cycle dataset, human B-cells lymphoma dataset and Yeast Stress dataset, finding a remarkable number of biclusters with shifting and scaling patterns. In addition, the performance of the proposed method and fitness function are compared to that of CC, OPSM, ISA, BiMax, xMotifs and Samba using Gene the Ontology Database
Correlation–Based Scatter Search for Discovering Biclusters from Gene Expression Data
Scatter Search is an evolutionary method that combines ex isting solutions to create new offspring as the well–known genetic algo rithms. This paper presents a Scatter Search with the aim of finding
biclusters from gene expression data. However, biclusters with certain
patterns are more interesting from a biological point of view. Therefore,
the proposed Scatter Search uses a measure based on linear correlations
among genes to evaluate the quality of biclusters. As it is usual in Scatter
Search methodology an improvement method is included which avoids
to find biclusters with negatively correlated genes. Experimental results
from yeast cell cycle and human B-cell lymphoma datasets are reported
showing a remarkable performance of the proposed method and measureMinisterio de Ciencia y Tecnología TIN2007-68084-C00Junta de Andalucía P07-TIC-0261
Databases Reduction Simultaneously by Ordered Projection
In this paper, a new algorithm Database Reduction Simulta neously by Ordered Projections (RESOP) is introduced. This algorithm
reduces databases in two directions: editing examples and feature se lection simultaneously. Ordered projections techniques have been used
to design RESOP taking advantage of symmetrical ideas for two dif ferent task. Experimental results have been made with UCI Repository
databases and the performance for the latter application of classification
techniques has been satisfactor
Effect of ACTN3 R577X Genotype on Injury Epidemiology in Elite Endurance Runners´.
The p.R577X polymorphism (rs1815739) in the ACTN3 gene causes individuals with the ACTN3 XX genotype to be deficient in functional α-actinin-3. Previous investigations have found that XX athletes are more prone to suffer non-contact muscle injuries. This investigation aimed to determine the influence of the ACTN3 R577X polymorphism in the injury epidemiology of elite endurance athletes. Using a cross-sectional experiment, the epidemiology of running-related injuries was recorded for one season in a group of 89 Spanish elite endurance runners. ACTN3 R577X genotype was obtained for each athlete using genomic DNA samples. From the study sample, 42.7% of athletes had the RR genotype, 39.3% had the RX genotype, and 18.0% had the XX genotype. A total of 96 injuries were recorded in 57 athletes. Injury incidence was higher in RR runners (3.2 injuries/1000 h of running) than in RX (2.0 injuries/1000 h) and XX (2.2 injuries/1000 h; p = 0.030) runners. RR runners had a higher proportion of injuries located in the Achilles tendon, RX runners had a higher proportion of injuries located in the knee, and XX runners had a higher proportion of injuries located in the groin (p = 0.025). The ACTN3 genotype did not affect the mode of onset, the severity, or the type of injury. The ACTN3 genotype slightly affected the injury epidemiology of elite endurance athletes with a higher injury rate in RR athletes and differences in injury location. However, elite ACTN3 XX endurance runners were not more prone to muscle-type injuries.post-print248 K
Biclustering of Gene Expression Data Based on SimUI Semantic Similarity Measure
Biclustering is an unsupervised machine learning technique
that simultaneously clusters genes and conditions in gene expression
data. Gene Ontology (GO) is usually used in this context to validate
the biological relevance of the results. However, although the integration
of biological information from different sources is one of the research
directions in Bioinformatics, GO is not used in biclustering as an input
data. A scatter search-based algorithm that integrates GO information
during the biclustering search process is presented in this paper. SimUI
is a GO semantic similarity measure that defines a distance between two
genes. The algorithm optimizes a fitness function that uses SimUI to
integrate the biological information stored in GO. Experimental results
analyze the effect of integration of the biological information through
this measure. A SimUI fitness function configuration is experimentally
studied in a scatter search-based biclustering algorithmMinisterio de Ciencia e Innovación TIN2011-28956-C02-02Ministerio de Ciencia e Innovación TIN2014-55894-C2-RJunta de Andalucía P12-TIC-1728Universidad Pablo de Olavide APPB81309
A Measure for Data Set Editing by Ordered Projections
In this paper we study a measure, named weakness of an
example, which allows us to establish the importance of an example to
find representative patterns for the data set editing problem. Our ap proach consists in reducing the database size without losing information,
using algorithm patterns by ordered projections. The idea is to relax the
reduction factor with a new parameter, λ, removing all examples of the
database whose weakness verify a condition over this λ. We study how
to establish this new parameter. Our experiments have been carried out
using all databases from UCI-Repository and they show that is possible
a size reduction in complex databases without notoriously increase of the
error rate
A Clustering-Based Hybrid Support Vector Regression Model to Predict Container Volume at Seaport Sanitary Facilities
An accurate prediction of freight volume at the sanitary facilities of seaports is a key factor to improve planning operations and resource allocation. This study proposes a hybrid approach to forecast container volume at the sanitary facilities of a seaport. The methodology consists of a three-step procedure, combining the strengths of linear and non-linear models and the capability of a clustering technique. First, a self-organizing map (SOM) is used to decompose the time series into smaller clusters easier to predict. Second, a seasonal autoregressive integrated moving averages (SARIMA) model is applied in each cluster in order to obtain predicted values and residuals of each cluster. These values are finally used as inputs of a support vector regression (SVR) model together with the historical data of the cluster. The final prediction result integrates the prediction results of each cluster. The experimental results showed that the proposed model provided accurate prediction results and outperforms the rest of the models tested. The proposed model can be used as an automatic decision-making tool by seaport management due to its capacity to plan resources in advance, avoiding congestion and time delays
Artificial Neural Networks, Sequence-to-Sequence LSTMs, and Exogenous Variables as Analytical Tools for NO2 (Air Pollution) Forecasting: A Case Study in the Bay of Algeciras (Spain)
This study aims to produce accurate predictions of the NO2 concentrations at a specific station of a monitoring network located in the Bay of Algeciras (Spain). Artificial neural networks (ANNs) and sequence-to-sequence long short-term memory networks (LSTMs) were used to create the forecasting models. Additionally, a new prediction method was proposed combining LSTMs using a rolling window scheme with a cross-validation procedure for time series (LSTM-CVT). Two different strategies were followed regarding the input variables: using NO2 from the station or employing NO2 and other pollutants data from any station of the network plus meteorological variables. The ANN and LSTM-CVT exogenous models used lagged datasets of different window sizes. Several feature ranking methods were used to select the top lagged variables and include them in the final exogenous datasets. Prediction horizons of t + 1, t + 4 and t + 8 were employed. The exogenous variables inclusion enhanced the model's performance, especially for t + 4 (rho approximate to 0.68 to rho approximate to 0.74) and t + 8 (rho approximate to 0.59 to rho approximate to 0.66). The proposed LSTM-CVT method delivered promising results as the best performing models per prediction horizon employed this new methodology. Additionally, per each parameter combination, it obtained lower error values than ANNs in 85% of the cases
- …